MyWorkspace

Connecting... Still working... spinner

Upgrade
proj2
    Loading...
    Cmd 1

    Human Activity Recognition Using SmartDevice Data(Phone/Watch)

    Cmd 2
    # The applied options are for CSV files. For other file types, these will be ignored.
    col = ['participant_id' , 'activity_code' , 'timestamp', 'x', 'y', 'z']
    raw_par_10_phone_accel = spark.read.format("csv") \
                 .option("header", "false") \
                 .option("inferSchema", "true") \
                 .option("delimiter", ",") \
                 .load("s3://humanactivity/wisdm-dataset/raw/phone/accel/data_1610_accel_phone.txt") \
                 .toDF(*col)
    
    display(raw_par_10_phone_accel)

    (3) Spark Jobs

    Copied!
     
    participant_id
    activity_code
    timestamp
    x
    y
    z
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    1610
    A
    18687441561967
    1.1749573
    13.347473
    -4.0346375;
    1610
    A
    18687491915971
    1.4081879
    7.091858
    -3.8957214;
    1610
    A
    18687542269974
    4.9325104
    6.3068085
    -2.3390045;
    1610
    A
    18687592623978
    0.15464783
    6.1235046
    -1.8314667;
    1610
    A
    18687642977982
    -2.8260345
    4.180542
    -3.2118988;
    1610
    A
    18687693331986
    -4.2210693
    6.18602
    3.352295;
    1610
    A
    18687743685990
    8.807434
    14.673569
    0.69750977;
    1610
    A
    18687794039994
    9.771988
    13.823517
    -3.5910187;
    1610
    A
    18687844393998
    2.1411438
    10.951889
    -3.8944092;
    1610
    A
    18687894748002
    0.08996582
    5.83667
    -4.4335938;
    1610
    A
    18687945102006
    -0.6851654
    5.2144165
    -2.6980743;
    1610
    A
    18687995456010
    3.0335083
    14.796036
    -2.8624268;
    1610
    A
    18688045810013
    4.3611145
    10.702621
    -2.3544464;
    1610
    A
    18688096164017
    3.7458954
    4.4772034
    -0.70654297;
    1610
    A
    18688146518021
    6.263504
    6.0657043
    -1.1902008;
    1610
    A
    18688196872025
    5.9160004
    8.954636
    -1.4206085;
    1610
    A
    18688247226029
    3.7761688
    7.9741364
    -1.6654816;
    10,000 rows|Truncated data
    |
    4.22 seconds runtime
    Refreshed 3 days ago
    Command took 4.22 seconds -- by vijaypawar6677@gmail.com at 09/03/2023, 00:17:27 on vijay pawar's Cluster
    Cmd 3

    EDA

    Cmd 4
    raw_par_10_phone_accel.show(5)

    (1) Spark Jobs

    +--------------+-------------+--------------+----------+---------+-----------+ |participant_id|activity_code| timestamp| x| y| z| +--------------+-------------+--------------+----------+---------+-----------+ | 1610| A|18687441561967| 1.1749573|13.347473|-4.0346375;| | 1610| A|18687491915971| 1.4081879| 7.091858|-3.8957214;| | 1610| A|18687542269974| 4.9325104|6.3068085|-2.3390045;| | 1610| A|18687592623978|0.15464783|6.1235046|-1.8314667;| | 1610| A|18687642977982|-2.8260345| 4.180542|-3.2118988;| +--------------+-------------+--------------+----------+---------+-----------+ only showing top 5 rows
    Command took 0.36 seconds -- by vijaypawar6677@gmail.com at 09/03/2023, 00:17:30 on vijay pawar's Cluster
    Cmd 5
    raw_par_10_phone_accel.dtypes
    Out[89]: [('participant_id', 'int'), ('activity_code', 'string'), ('timestamp', 'bigint'), ('x', 'double'), ('y', 'double'), ('z', 'string')]
    Command took 0.08 seconds -- by vijaypawar6677@gmail.com at 09/03/2023, 00:17:35 on vijay pawar's Cluster
    Cmd 6
    from pyspark.sql.functions import regexp_replace
    raw_par_10_phone_accel=raw_par_10_phone_accel.withColumn('z',regexp_replace('z',';',''))
    Command took 0.10 seconds -- by vijaypawar6677@gmail.com at 09/03/2023, 00:17:39 on vijay pawar's Cluster
    Cmd 7
    from pyspark.sql.functions import col
    from pyspark.sql.types import DoubleType
    
    
    Command took 0.10 seconds -- by vijaypawar6677@gmail.com at 09/03/2023, 20:05:45 on vijay pawar's Cluster
    Cmd 8
    from pyspark.sql.types import DoubleType
    raw_par_10_phone_accel = raw_par_10_phone_accel.withColumn("z", col("z").cast(DoubleType()))
    Command took 0.15 seconds -- by vijaypawar6677@gmail.com at 09/03/2023, 00:17:45 on vijay pawar's Cluster
    Cmd 9
    raw_par_10_phone_accel.dtypes
    Out[93]: [('participant_id', 'int'), ('activity_code', 'string'), ('timestamp', 'bigint'), ('x', 'double'), ('y', 'double'), ('z', 'double')]
    Command took 0.07 seconds -- by vijaypawar6677@gmail.com at 09/03/2023, 00:17:49 on vijay pawar's Cluster
    Cmd 10
    from pyspark.sql.functions import unix_timestamp
    raw_par_10_phone_accel = raw_par_10_phone_accel.withColumn('timestamp', col('timestamp').cast('timestamp'))
    Command took 0.10 seconds -- by vijaypawar6677@gmail.com at 08/03/2023, 23:39:27 on vijay pawar's Cluster
    Cmd 11
    raw_par_10_phone_accel.dtypes
    Out[78]: [('participant_id', 'int'), ('activity_code', 'string'), ('timestamp', 'timestamp'), ('x', 'double'), ('y', 'double'), ('z', 'double')]
    Command took 0.06 seconds -- by vijaypawar6677@gmail.com at 08/03/2023, 23:39:41 on vijay pawar's Cluster
    Cmd 12
    raw_par_10_phone_accel.show(5)

    (1) Spark Jobs

    +--------------+-------------+--------------+----------+---------+----------+ |participant_id|activity_code| timestamp| x| y| z| +--------------+-------------+--------------+----------+---------+----------+ | 1610| A|18687441561967| 1.1749573|13.347473|-4.0346375| | 1610| A|18687491915971| 1.4081879| 7.091858|-3.8957214| | 1610| A|18687542269974| 4.9325104|6.3068085|-2.3390045| | 1610| A|18687592623978|0.15464783|6.1235046|-1.8314667| | 1610| A|18687642977982|-2.8260345| 4.180542|-3.2118988| +--------------+-------------+--------------+----------+---------+----------+ only showing top 5 rows
    Command took 0.37 seconds -- by vijaypawar6677@gmail.com at 08/03/2023, 22:56:45 on vijay pawar's Cluster
    Cmd 13
    from pyspark.sql.functions import *
    columns = ['participant_id','activity_code','timestamp', 'x','y','z']
    for i in columns:
        print(i,raw_par_10_phone_accel.filter(raw_par_10_phone_accel[i].isNull()).count())

    (12) Spark Jobs

    participant_id 0 activity_code 0 timestamp 0 x 0 y 0 z 0
    Command took 2.34 seconds -- by vijaypawar6677@gmail.com at 09/03/2023, 01:10:00 on vijay pawar's Cluster
    Cmd 14
    activity_codes_mapping = {'A': 'walking',
                              'B': 'jogging',
                              'C': 'stairs',
                              'D': 'sitting',
                              'E': 'standing',
                              'F': 'typing',
                              'G': 'brushing teeth',
                              'H': 'eating soup',
                              'I': 'eating chips',
                              'J': 'eating pasta',
                              'K': 'drinking from cup',
                              'L': 'eating sandwich',
                              'M': 'kicking soccer ball',
                              'O': 'playing catch tennis ball',
                              'P': 'dribbling basket ball',
                              'Q': 'writing',
                              'R': 'clapping',
                              'S': 'folding clothes'}
    Command took 0.10 seconds -- by vijaypawar6677@gmail.com at 10/03/2023, 23:26:42 on vijay pawar's Cluster
    Cmd 15
    def activity_codes_mapping_udf(code):
        return activity_codes_mapping.get(code, 'unknown')
    Command took 0.04 seconds -- by vijaypawar6677@gmail.com at 10/03/2023, 22:52:35 on vijay pawar's Cluster
    Cmd 16
    activity_color_map = {activity_codes_mapping['A']: 'lime',
                          activity_codes_mapping['B']: 'red',
                          activity_codes_mapping['C']: 'blue',
                          activity_codes_mapping['D']: 'orange',
                          activity_codes_mapping['E']: 'yellow',
                          activity_codes_mapping['F']: 'lightgreen',
                          activity_codes_mapping['G']: 'greenyellow',
                          activity_codes_mapping['H']: 'magenta',
                          activity_codes_mapping['I']: 'gold',
                          activity_codes_mapping['J']: 'cyan',
                          activity_codes_mapping['K']: 'purple',
                          activity_codes_mapping['L']: 'lightgreen',
                          activity_codes_mapping['M']: 'violet',
                          activity_codes_mapping['O']: 'limegreen',
                          activity_codes_mapping['P']: 'deepskyblue',   
                          activity_codes_mapping['Q']: 'mediumspringgreen',
                          activity_codes_mapping['R']: 'plum',
                          activity_codes_mapping['S']: 'olive'}
    Command took 0.09 seconds -- by vijaypawar6677@gmail.com at 08/03/2023, 22:20:07 on vijay pawar's Cluster
    Cmd 17
    column_data=raw_par_10_phone_accel.select('x')
    histogram = column_data.rdd.flatMap(lambda x: x).histogram(4)
    
    # plot the histogram
    import matplotlib.pyplot as plt
    plt.hist(column_data.rdd.flatMap(lambda x: x).collect(), bins=50, color='green')
    plt.show()

    (3) Spark Jobs

    Command took 2.34 seconds -- by vijaypawar6677@gmail.com at 08/03/2023, 22:25:43 on vijay pawar's Cluster
    Cmd 18
    column_data=raw_par_10_phone_accel.select('y')
    histogram = column_data.rdd.flatMap(lambda x: x).histogram(4)
    
    # plot the histogram
    import matplotlib.pyplot as plt
    plt.hist(column_data.rdd.flatMap(lambda x: x).collect(), bins=50, color='green')
    plt.show()

    (3) Spark Jobs

    Command took 2.36 seconds -- by vijaypawar6677@gmail.com at 08/03/2023, 22:20:08 on vijay pawar's Cluster
    Cmd 19
    column_data=raw_par_10_phone_accel.select('z')
    histogram = column_data.rdd.flatMap(lambda x: x).histogram(4)
    
    # plot the histogram
    import matplotlib.pyplot as plt
    plt.hist(column_data.rdd.flatMap(lambda x: x).collect(), bins=50, color='green')
    plt.show()

    (3) Spark Jobs

    Command took 2.49 seconds -- by vijaypawar6677@gmail.com at 08/03/2023, 22:20:08 on vijay pawar's Cluster
    Cmd 20
    def show_accel_per_activity(device, df, act, interval_in_sec = None):
      ''' Plots acceleration time history per activity '''
    
      df1 = df.loc[df.activity_code == act].copy()
      df1.reset_index(drop = True, inplace = True)
      
      df1['duration'] = (df1['timestamp'] - df1['timestamp'].iloc[0])/1000000000 # nanoseconds --> seconds
      
      if interval_in_sec == None:
        ax = df1[:].plot(kind='line', x='duration', y=['x','y','z'], figsize=(25,7), grid = True) # ,title = act)
      else:
        ax = df1[:interval_in_sec*20].plot(kind='line', x='duration', y=['x','y','z'], figsize=(25,7), grid = True) # ,title = act)
    
      ax.set_xlabel('duration  (sec)', fontsize = 15)
      ax.set_ylabel('acceleration  (m/sec^2)',fontsize = 15)
      ax.set_title('Acceleration:   Device: ' + device + '      Activity:  ' +activity_codes_mapping[act], fontsize = 15)
    Command took 0.07 seconds -- by vijaypawar6677@gmail.com at 09/03/2023, 20:07:03 on vijay pawar's Cluster
    Cmd 21
    raw_par_10_phone_accel_pan=raw_par_10_phone_accel.toPandas()

    (1) Spark Jobs

    Command took 0.60 seconds -- by vijaypawar6677@gmail.com at 09/03/2023, 00:18:06 on vijay pawar's Cluster
    Cmd 22
    for key in activity_codes_mapping:
      show_accel_per_activity('Phone', raw_par_10_phone_accel_pan, key, 10)
    Command took 6.25 seconds -- by vijaypawar6677@gmail.com at 09/03/2023, 00:52:40 on vijay pawar's Cluster
    Cmd 23
    col = ['participant_id' , 'activity_code' , 'timestamp', 'x', 'y', 'z']
    raw_par_20_watch_accel = spark.read.format("csv") \
                 .option("header", "false") \
                 .option("inferSchema", "true") \
                 .option("delimiter", ",") \
                 .load("s3://humanactivity/wisdm-dataset/raw/watch/accel/data_1620_accel_watch.txt") \
                 .toDF(*col)
    
    display(raw_par_20_watch_accel)

    (3) Spark Jobs

    Copied!
     
    participant_id
    activity_code
    timestamp
    x
    y
    z
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    1620
    A
    37945621841814
    3.4174237
    -2.1649568
    -4.849306;
    1620
    A
    37945671341814
    5.4237647
    -6.9366007
    -4.954651;
    1620
    A
    37945720841814
    4.7007155
    -3.127426
    -7.6481276;
    1620
    A
    37945770341814
    8.033444
    -4.7004166
    -5.0240827;
    1620
    A
    37945819841814
    3.9704843
    -1.6454151
    -3.6210804;
    1620
    A
    37945869341814
    5.689521
    -0.5847838
    -5.3137813;
    1620
    A
    37945918841814
    10.425252
    -4.5663414
    -10.746224;
    1620
    A
    37945968341814
    11.069292
    -8.002021
    -8.656087;
    1620
    A
    37946017841814
    6.1013236
    -2.8089972
    -4.6793175;
    1620
    A
    37946067341814
    7.9640126
    -10.422559
    -4.9139495;
    1620
    A
    37946116841814
    4.6528316
    -2.7970262
    -4.339341;
    1620
    A
    37946166341814
    4.4253826
    -1.7818847
    -3.8724716;
    1620
    A
    37946215841814
    5.2441998
    -4.1401734
    0.027682956;
    1620
    A
    37946265341814
    8.203433
    -2.9167361
    -1.8613422;
    1620
    A
    37946314841814
    7.3223667
    -10.444106
    -9.324069;
    1620
    A
    37946364341814
    3.2977135
    -2.3517046
    -8.215553;
    1620
    A
    37946413841814
    6.278495
    -3.6182373
    -3.5995326;
    10,000 rows|Truncated data
    |
    3.66 seconds runtime
    Refreshed 2 days ago
    Command took 3.66 seconds -- by vijaypawar6677@gmail.com at 09/03/2023, 20:05:18 on vijay pawar's Cluster
    Cmd 24
    from pyspark.sql.functions import regexp_replace
    raw_par_20_watch_accel=raw_par_20_watch_accel.withColumn('z',regexp_replace('z',';',''))
    
    
    Command took 0.14 seconds -- by vijaypawar6677@gmail.com at 09/03/2023, 20:05:26 on vijay pawar's Cluster
    Cmd 25
    from pyspark.sql.types import DoubleType
    raw_par_20_watch_accel = raw_par_20_watch_accel.withColumn("z", col("z").cast(DoubleType()))
    Command took 0.12 seconds -- by vijaypawar6677@gmail.com at 09/03/2023, 20:05:53 on vijay pawar's Cluster
    Cmd 26
    raw_par_20_watch_accel.dtypes
    Out[24]: [('participant_id', 'int'), ('activity_code', 'string'), ('timestamp', 'bigint'), ('x', 'double'), ('y', 'double'), ('z', 'double')]
    Command took 0.06 seconds -- by vijaypawar6677@gmail.com at 09/03/2023, 20:06:00 on vijay pawar's Cluster
    Cmd 27
    from pyspark.sql.functions import unix_timestamp
    raw_par_20_watch_accel = raw_par_20_watch_accel.withColumn('timestamp', col('timestamp').cast('timestamp'))
    Command took 0.12 seconds -- by vijaypawar6677@gmail.com at 09/03/2023, 19:57:18 on vijay pawar's Cluster
    Cmd 28
    raw_par_20_watch_accel.show(4)

    (1) Spark Jobs

    +--------------+-------------+--------------------+---------+----------+----------+ |participant_id|activity_code| timestamp| x| y| z| +--------------+-------------+--------------------+---------+----------+----------+ | 1620| A|+35310-10-16 07:3...|3.4174237|-2.1649568| -4.849306| | 1620| A|+35312-05-11 05:3...|5.4237647|-6.9366007| -4.954651| | 1620| A|+35313-12-05 03:3...|4.7007155| -3.127426|-7.6481276| | 1620| A|+35315-07-01 01:3...| 8.033444|-4.7004166|-5.0240827| +--------------+-------------+--------------------+---------+----------+----------+ only showing top 4 rows
    Command took 0.97 seconds -- by vijaypawar6677@gmail.com at 09/03/2023, 19:57:30 on vijay pawar's Cluster
    Cmd 29
    raw_par_20_watch_accel.dtypes
    Out[12]: [('participant_id', 'int'), ('activity_code', 'string'), ('timestamp', 'timestamp'), ('x', 'double'), ('y', 'double'), ('z', 'double')]
    Command took 0.11 seconds -- by vijaypawar6677@gmail.com at 09/03/2023, 19:58:12 on vijay pawar's Cluster
    Cmd 30
    from pyspark.sql.functions import *
    columns = ['participant_id','activity_code','timestamp', 'x','y','z']
    for i in columns:
        print(i,raw_par_20_watch_accel.filter(raw_par_20_watch_accel[i].isNull()).count())

    (12) Spark Jobs

    participant_id 0 activity_code 0 timestamp 0 x 0 y 0 z 0
    1
    Command took 15.70 seconds -- by vijaypawar6677@gmail.com at 09/03/2023, 19:58:50 on vijay pawar's Cluster
    Cmd 31
    column_data=raw_par_20_watch_accel.select('x')
    histogram = column_data.rdd.flatMap(lambda x: x).histogram(4)
    
    # plot the histogram
    import matplotlib.pyplot as plt
    plt.hist(column_data.rdd.flatMap(lambda x: x).collect(), bins=50, color='green')
    plt.show()

    (3) Spark Jobs

    Command took 4.99 seconds -- by vijaypawar6677@gmail.com at 09/03/2023, 19:59:32 on vijay pawar's Cluster
    Cmd 32
    column_data=raw_par_20_watch_accel.select('y')
    histogram = column_data.rdd.flatMap(lambda x: x).histogram(4)
    
    # plot the histogram
    import matplotlib.pyplot as plt
    plt.hist(column_data.rdd.flatMap(lambda x: x).collect(), bins=50, color='green')
    plt.show()

    (3) Spark Jobs

    Command took 2.09 seconds -- by vijaypawar6677@gmail.com at 09/03/2023, 20:00:26 on vijay pawar's Cluster
    Cmd 33
    column_data=raw_par_20_watch_accel.select('z')
    histogram = column_data.rdd.flatMap(lambda x: x).histogram(4)
    
    # plot the histogram
    import matplotlib.pyplot as plt
    plt.hist(column_data.rdd.flatMap(lambda x: x).collect(), bins=50, color='green')
    plt.show()

    (3) Spark Jobs

    Command took 4.33 seconds -- by vijaypawar6677@gmail.com at 09/03/2023, 20:00:24 on vijay pawar's Cluster
    Cmd 34
    raw_par_20_watch_accel_pan=raw_par_20_watch_accel.toPandas()

    (1) Spark Jobs

    Command took 0.69 seconds -- by vijaypawar6677@gmail.com at 09/03/2023, 20:06:12 on vijay pawar's Cluster
    Cmd 35
    Cmd 36
    col = ['participant_id' , 'activity_code' , 'timestamp', 'x', 'y', 'z']
    raw_par_35_phone_gyro = spark.read.format("csv") \
                 .option("header", "false") \
                 .option("inferSchema", "true") \
                 .option("delimiter", ",") \
                 .load("s3://humanactivity/wisdm-dataset/raw/phone/gyro/data_1635_gyro_phone.txt") \
                 .toDF(*col)
    
    display(raw_par_35_phone_gyro)

    (3) Spark Jobs

    Copied!
     
    participant_id
    activity_code
    timestamp
    x
    y
    z
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    1635
    A
    676165107876419
    0.87472534
    -0.098098755
    0.8075104;
    1635
    A
    676165158230423
    0.5634308
    -1.3596649
    1.4324188;
    1635
    A
    676165208584427
    1.5076904
    -3.0868835
    0.6790161;
    1635
    A
    676165258938431
    1.430603
    -1.8603058
    0.45103455;
    1635
    A
    676165309292435
    0.3251648
    -1.735611
    0.2523651;
    1635
    A
    676165359646439
    1.0207672
    -1.1070251
    -0.47756958;
    1635
    A
    676165410000443
    0.17964172
    -1.3453369
    -0.36697388;
    1635
    A
    676165460354447
    -0.44096375
    0.11174011
    0.4830017;
    1635
    A
    676165510708451
    -0.30691528
    -0.27601624
    0.12683105;
    1635
    A
    676165561062455
    -0.08236694
    -1.8129578
    0.4088745;
    1635
    A
    676165611416459
    -0.30596924
    3.1231995
    2.746582E-4;
    1635
    A
    676165661770462
    -1.5361023
    3.7741547
    -0.5243988;
    1635
    A
    676165712124466
    -3.51445
    -1.676651
    -0.8505249;
    1635
    A
    676165762478470
    -3.2464447
    -1.2213898
    -0.9306946;
    1635
    A
    676165812832474
    -1.2608948
    1.4446106
    -0.20780945;
    1635
    A
    676165863186478
    -1.0664368
    1.1829987
    -0.40901184;
    1635
    A
    676165913540482
    -1.4516144
    -0.79855347
    -0.5939636;
    10,000 rows|Truncated data
    |
    1.57 seconds runtime
    Refreshed 2 days ago
    Command took 1.57 seconds -- by vijaypawar6677@gmail.com at 09/03/2023, 20:16:11 on vijay pawar's Cluster
    Cmd 37
    raw_par_35_phone_gyro=raw_par_35_phone_gyro.withColumn('z',regexp_replace('z',';',''))
    
    
    Command took 0.06 seconds -- by vijaypawar6677@gmail.com at 09/03/2023, 20:16:57 on vijay pawar's Cluster
    Cmd 38
    from pyspark.sql.functions import col
    raw_par_35_phone_gyro = raw_par_35_phone_gyro.withColumn("z", col("z").cast(DoubleType()))
    raw_par_35_phone_gyro.dtypes
    
    
    Out[39]: [('participant_id', 'int'), ('activity_code', 'string'), ('timestamp', 'bigint'), ('x', 'double'), ('y', 'double'), ('z', 'double')]
    Command took 0.16 seconds -- by vijaypawar6677@gmail.com at 09/03/2023, 20:18:38 on vijay pawar's Cluster
    Cmd 39
    Cmd 40
    1
    column_data=raw_par_35_phone_gyro.select('x')
    2
    histogram = column_data.rdd.flatMap(lambda x: x).histogram(4)
    3
    4
    # plot the histogram
    5
    import matplotlib.pyplot as plt
    6
    plt.hist(column_data.rdd.flatMap(lambda x: x).collect(), bins=50, color='green')
    7
    plt.show()
    8
    9
    10
    column_data=raw_par_35_phone_gyro.select('y')
    11
    histogram = column_data.rdd.flatMap(lambda x: x).histogram(4)
    12
    13
    # plot the histogram
    14
    import matplotlib.pyplot as plt
    15
    plt.hist(column_data.rdd.flatMap(lambda x: x).collect(), bins=50, color='green')
    16
    plt.show()
    17
    18
    19
    20
    column_data=raw_par_35_phone_gyro.select('z')
    21
    histogram = column_data.rdd.flatMap(lambda x: x).histogram(4)
    22
    23
    # plot the histogram
    24
    import matplotlib.pyplot as plt
    25
    plt.hist(column_data.rdd.flatMap(lambda x: x).collect(), bins=50, color='green')
    26
    plt.show()

    (9) Spark Jobs

    Command took 4.92 seconds -- by vijaypawar6677@gmail.com at 09/03/2023, 20:25:23 on vijay pawar's Cluster
    Cmd 41
    raw_par_35_phone_gyro_pan=raw_par_35_phone_gyro.toPandas()

    (1) Spark Jobs

    Command took 0.51 seconds -- by vijaypawar6677@gmail.com at 09/03/2023, 20:26:32 on vijay pawar's Cluster
    Cmd 42
    def show_ang_velocity_per_activity(device, df, act, interval_in_sec = None):
      ''' Plots angular volocity time history per activity '''
    
      df1 = df.loc[df.activity_code == act].copy()
      df1.reset_index(drop = True, inplace = True)
    
      df1['duration'] = (df1['timestamp'] - df1['timestamp'].iloc[0])/1000000000 # nanoseconds --> seconds
    
      if interval_in_sec == None:
        ax = df1[:].plot(kind='line', x='duration', y=['x','y','z'], figsize=(25,7), grid = True) # ,title = act)
      else:
        ax = df1[:interval_in_sec*20].plot(kind='line', x='duration', y=['x','y','z'], figsize=(25,7), grid = True) # ,title = act)
    
      ax.set_xlabel('duration  (sec)', fontsize = 15)
      ax.set_ylabel('angular velocity  (rad/sec)',fontsize = 15)
      ax.set_title('Angular velocity:  Device: ' + device + '      Activity:  ' +activity_codes_mapping[act] , fontsize = 15)
    Command took 0.08 seconds -- by vijaypawar6677@gmail.com at 09/03/2023, 20:32:16 on vijay pawar's Cluster
    Cmd 43
    for key in activity_codes_mapping:
      show_ang_velocity_per_activity('Phone', raw_par_35_phone_gyro_pan, key)
    Command took 7.79 seconds -- by vijaypawar6677@gmail.com at 09/03/2023, 20:37:00 on vijay pawar's Cluster
    Cmd 44
    col = ['participant_id' , 'activity_code' , 'timestamp', 'x', 'y', 'z']
    raw_par_35_watch_gyro = spark.read.format("csv") \
                 .option("header", "false") \
                 .option("inferSchema", "true") \
                 .option("delimiter", ",") \
                 .load("s3://humanactivity/wisdm-dataset/raw/watch/gyro/data_1635_gyro_watch.txt") \
                 .toDF(*col)
    
    display(raw_par_35_watch_gyro)

    (3) Spark Jobs

    Copied!
     
    participant_id
    activity_code
    timestamp
    x
    y
    z
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    1635
    A
    64768437041043
    0.11791146
    -0.14554176
    -1.3008211;
    1635
    A
    64768486541043
    1.0382999
    -0.4555337
    -1.1346397;
    1635
    A
    64768536041043
    0.8763797
    -0.5418201
    -0.84488785;
    1635
    A
    64768585541043
    0.9125987
    -0.38842204
    -0.394281;
    1635
    A
    64768635041043
    0.048669267
    -0.046472162
    0.2885535;
    1635
    A
    64768684541043
    -0.3294996
    -0.016644757
    1.1471566;
    1635
    A
    64768734041043
    0.58236676
    0.07070693
    1.4699317;
    1635
    A
    64768783541043
    0.26811373
    0.4584632
    2.4883246;
    1635
    A
    64768833041043
    1.6625448
    -0.23502396
    3.2159002;
    1635
    A
    64768882541043
    2.5392575
    -0.14767228
    3.6238964;
    1635
    A
    64768932041043
    2.1262734
    0.41313323
    3.6237595;
    1635
    A
    64768981541043
    1.1036196
    0.73910415
    3.2274811;
    1635
    A
    64769031041043
    0.020245617
    1.2568227
    2.5265372;
    1635
    A
    64769080541043
    -1.2452885
    1.4315261
    1.1779124;
    1635
    A
    64769130041043
    -2.1848516
    1.241909
    -0.6852351;
    1635
    A
    64769179541043
    -2.5907176
    0.75934416
    -2.566492;
    1635
    A
    64769229041043
    -1.3507497
    0.00087591365
    -3.9630537;
    10,000 rows|Truncated data
    |
    1.33 seconds runtime
    Refreshed 2 days ago
    Command took 1.33 seconds -- by vijaypawar6677@gmail.com at 09/03/2023, 20:39:51 on vijay pawar's Cluster
    Cmd 45
    1
    raw_par_35_watch_gyro=raw_par_35_watch_gyro.withColumn('z',regexp_replace('z',';',''))
    Command took 0.07 seconds -- by vijaypawar6677@gmail.com at 09/03/2023, 20:44:23 on vijay pawar's Cluster
    Cmd 46
    Cmd 47
    Cmd 48
    Cmd 49
    raw_par_35_watch_gyro_pan=raw_par_35_watch_gyro.toPandas()

    (1) Spark Jobs

    Command took 1.06 seconds -- by vijaypawar6677@gmail.com at 09/03/2023, 20:44:42 on vijay pawar's Cluster
    Cmd 50
    for key in activity_codes_mapping:
      show_ang_velocity_per_activity('Watch', raw_par_35_watch_gyro_pan, key)
    Command took 7.70 seconds -- by vijaypawar6677@gmail.com at 09/03/2023, 20:44:46 on vijay pawar's Cluster
    Cmd 51
    Cmd 52
    Cmd 53
    Cmd 54
    Cmd 55
    from pyspark.sql.functions import col
    from pyspark.sql.types import DoubleType
    for coli in all_phone_accel.columns:
        if coli == 'ACTIVITY':
            continue
        else:
            all_phone_accel = all_phone_accel.withColumn(coli, col(coli).cast(DoubleType()))
            
    Command took 4.65 seconds -- by vijaypawar6677@gmail.com at 10/03/2023, 20:10:51 on vijay pawar's Cluster
    Cmd 56
    Cmd 57
    Cmd 58
    import matplotlib.pyplot as plt
    import pandas as pd
    
    # Group by activity and count the rows
    activity_counts = all_phone_accel.groupBy('ACTIVITY') \
        .count() \
        .orderBy('count', ascending=False)
    
    # Convert to Pandas DataFrame
    activity_counts_pd = activity_counts.toPandas()
    
    # Plot the results
    _ = activity_counts_pd.plot(kind='bar', x='ACTIVITY', y='count',
                                figsize=(15, 5), color='purple',
                                title='row count per activity',
                                legend=True, fontsize=15)
    plt.show()

    (4) Spark Jobs

    Command took 5.42 seconds -- by vijaypawar6677@gmail.com at 10/03/2023, 20:18:36 on vijay pawar's Cluster
    Cmd 59
    # Group by activity and count the rows
    activity_counts = all_phone_accel.groupBy('PARTICIPANT') \
        .count() \
        .orderBy('count', ascending=False)
    
    # Convert to Pandas DataFrame
    activity_counts_pd = activity_counts.toPandas()
    
    # Plot the results
    _ = activity_counts_pd.plot(kind='bar', x='PARTICIPANT', y='count',
                                figsize=(15, 5), color='purple',
                                title='row count per PARTICIPANT',
                                legend=True, fontsize=15)
    plt.show()

    (4) Spark Jobs

    Command took 3.56 seconds -- by vijaypawar6677@gmail.com at 10/03/2023, 20:21:16 on vijay pawar's Cluster
    Cmd 60
    all_phone_accel[['XABSOLDEV', 'YABSOLDEV','ZABSOLDEV','XSTANDDEV', 'YSTANDDEV', 'ZSTANDDEV', 'XVAR', 'YVAR', 'ZVAR']].show(4)

    (1) Spark Jobs

    +---------+---------+---------+---------+---------+---------+--------+--------+--------+ |XABSOLDEV|YABSOLDEV|ZABSOLDEV|XSTANDDEV|YSTANDDEV|ZSTANDDEV| XVAR| YVAR| ZVAR| +---------+---------+---------+---------+---------+---------+--------+--------+--------+ | 1.59095| 3.29508| 1.60941| 0.14117| 0.283329| 0.1598|0.375726|0.532286| 0.39975| | 1.77817| 3.3349| 1.68296| 0.161229| 0.287955| 0.157993|0.401533|0.536614|0.397483| | 1.70505| 3.14244| 1.69288| 0.1562| 0.269307| 0.159797|0.395221|0.518948|0.399746| | 1.62315| 3.3279| 1.57944| 0.141721| 0.283364| 0.154396|0.376459| 0.53232|0.392933| +---------+---------+---------+---------+---------+---------+--------+--------+--------+ only showing top 4 rows
    Command took 0.60 seconds -- by vijaypawar6677@gmail.com at 10/03/2023, 20:23:13 on vijay pawar's Cluster
    Cmd 61
    all_phone_accel = all_phone_accel.drop('XSTANDDEV', 'YSTANDDEV', 'ZSTANDDEV', 'XVAR', 'YVAR', 'ZVAR')
    Command took 0.07 seconds -- by vijaypawar6677@gmail.com at 10/03/2023, 20:27:45 on vijay pawar's Cluster
    Cmd 62
    Cmd 63
    all_phone_accel = all_phone_accel.drop('PARTICIPANT')
    Command took 0.16 seconds -- by vijaypawar6677@gmail.com at 10/03/2023, 20:32:36 on vijay pawar's Cluster
    Cmd 64
    Cmd 65
    X_train = spark.read.csv("s3://humanactivity/Train_test_spllit/X_train",header=True)
    X_test = spark.read.csv("s3://humanactivity/Train_test_spllit/X_test",header=True)
    y_train=spark.read.csv("s3://humanactivity/Train_test_spllit/y_train",header=True)
    y_test= spark.read.csv("s3://humanactivity/Train_test_spllit/y_test",header=True)
    Cmd 66
    Cmd 67
    Cmd 68
    Cmd 69
    Cmd 70
    X_test.count(),X_train.count()

    (4) Spark Jobs

    Out[52]: (5163, 15485)
    Command took 0.70 seconds -- by vijaypawar6677@gmail.com at 10/03/2023, 22:08:12 on vijay pawar's Cluster
    Cmd 71
    y_test.count(),y_train.count()

    (4) Spark Jobs

    Out[53]: (5163, 15485)
    Command took 0.47 seconds -- by vijaypawar6677@gmail.com at 10/03/2023, 22:08:41 on vijay pawar's Cluster
    Cmd 72
    par_23_df = spark.read.csv("s3://humanactivity/Cluster_diagram/par_23_df",header=True)

    (1) Spark Jobs

    Command took 0.67 seconds -- by vijaypawar6677@gmail.com at 10/03/2023, 23:26:04 on vijay pawar's Cluster
    Cmd 73
    import matplotlib.pyplot as plt
    from sklearn.manifold import TSNE
    import pandas as pd
    
    # Convert the PySpark DataFrame to a Pandas DataFrame
    par_23_pd_df = par_23_df.toPandas()
    
    # Extract the features and the target variable from the Pandas DataFrame
    yy = par_23_pd_df['ACTIVITY']
    XX = par_23_pd_df.drop(['ACTIVITY','PARTICIPANT','ACT','XSTANDDEV','YSTANDDEV','ZSTANDDEV','XVAR','YVAR','ZVAR'], axis = 1)
    
    # Apply t-SNE to reduce the dimensionality of the feature space to 2 dimensions
    tsne = TSNE(n_components=2, random_state=300)
    X_2d = tsne.fit_transform(XX)
    
    # Plot the t-SNE visualization using Matplotlib
    target_ids = tuple(activity_codes_mapping.keys())
    
    plt.figure(figsize=(10, 10))
    colors = 'lime', 'red', 'blue', 'orange', 'yellow', 'lightgreen', 'greenyellow', 'magenta', 'gold', 'cyan', 'purple', 'lightgreen', 'violet', 'limegreen', 'deepskyblue', 'mediumspringgreen', 'plum', 'olive'
    
    for i, c, label in zip(target_ids, colors, tuple(activity_codes_mapping.values())):
        plt.scatter(X_2d[yy == i, 0], X_2d[yy == i, 1], c=c, label=label)
    
    plt.legend()
    plt.show()

    (1) Spark Jobs

    Command took 3.77 seconds -- by vijaypawar6677@gmail.com at 10/03/2023, 23:26:51 on vijay pawar's Cluster
    Cmd 74
    par_35_df = spark.read.csv("s3://humanactivity/Cluster_diagram/par_35_df",header=True)

    (1) Spark Jobs

    Command took 0.70 seconds -- by vijaypawar6677@gmail.com at 10/03/2023, 23:54:47 on vijay pawar's Cluster
    Cmd 75
    # Convert the PySpark DataFrame to a Pandas DataFrame
    par_35_pd_df = par_35_df.toPandas()
    
    # Extract the features and the target variable from the Pandas DataFrame
    yy = par_23_pd_df['ACTIVITY']
    XX = par_23_pd_df.drop(['ACTIVITY','PARTICIPANT','ACT','XSTANDDEV','YSTANDDEV','ZSTANDDEV','XVAR','YVAR','ZVAR'], axis = 1)
    
    # Apply t-SNE to reduce the dimensionality of the feature space to 2 dimensions
    tsne = TSNE(n_components=2, random_state=300)
    X_2d = tsne.fit_transform(XX)
    
    # Plot the t-SNE visualization using Matplotlib
    target_ids = tuple(activity_codes_mapping.keys())
    
    plt.figure(figsize=(10, 10))
    colors = 'lime', 'red', 'blue', 'orange', 'yellow', 'lightgreen', 'greenyellow', 'magenta', 'gold', 'cyan', 'purple', 'lightgreen', 'violet', 'limegreen', 'deepskyblue', 'mediumspringgreen', 'plum', 'olive'
    
    for i, c, label in zip(target_ids, colors, tuple(activity_codes_mapping.values())):
        plt.scatter(X_2d[yy == i, 0], X_2d[yy == i, 1], c=c, label=label)
    
    plt.legend()
    plt.show()

    (1) Spark Jobs

    Command took 3.32 seconds -- by vijaypawar6677@gmail.com at 10/03/2023, 23:56:15 on vijay pawar's Cluster
    Cmd 76
    par_40_df = spark.read.csv("s3://humanactivity/Cluster_diagram/par_40_df",header=True)

    (1) Spark Jobs

    Command took 0.42 seconds -- by vijaypawar6677@gmail.com at 10/03/2023, 23:56:21 on vijay pawar's Cluster
    Cmd 77
    # Convert the PySpark DataFrame to a Pandas DataFrame
    par_40_pd_df = par_40_df.toPandas()
    
    # Extract the features and the target variable from the Pandas DataFrame
    yy = par_23_pd_df['ACTIVITY']
    XX = par_23_pd_df.drop(['ACTIVITY','PARTICIPANT','ACT','XSTANDDEV','YSTANDDEV','ZSTANDDEV','XVAR','YVAR','ZVAR'], axis = 1)
    
    # Apply t-SNE to reduce the dimensionality of the feature space to 2 dimensions
    tsne = TSNE(n_components=2, random_state=300)
    X_2d = tsne.fit_transform(XX)
    
    # Plot the t-SNE visualization using Matplotlib
    target_ids = tuple(activity_codes_mapping.keys())
    
    plt.figure(figsize=(10, 10))
    colors = 'lime', 'red', 'blue', 'orange', 'yellow', 'lightgreen', 'greenyellow', 'magenta', 'gold', 'cyan', 'purple', 'lightgreen', 'violet', 'limegreen', 'deepskyblue', 'mediumspringgreen', 'plum', 'olive'
    
    for i, c, label in zip(target_ids, colors, tuple(activity_codes_mapping.values())):
        plt.scatter(X_2d[yy == i, 0], X_2d[yy == i, 1], c=c, label=label)
    
    plt.legend()
    plt.show()

    (1) Spark Jobs

    Command took 3.33 seconds -- by vijaypawar6677@gmail.com at 10/03/2023, 23:56:24 on vijay pawar's Cluster
    Cmd 78
    1
    import pandas as pd
    2
    import matplotlib.pyplot as plt
    3
    from pyspark.sql.functions import count
    4
    5
    counts = y_train.groupBy('Y').agg(count('*').alias('count'))
    6
    activity_counts_pd = counts.toPandas()
    7
    8
    activity_counts_pd.plot(kind='bar', x='Y', y='count', color='red', figsize=(15,5), legend=False, fontsize=15)
    9
    plt.title('Row count per activity', fontsize=15)
    10
    plt.show()
    11

    (2) Spark Jobs

    Command took 0.74 seconds -- by vijaypawar6677@gmail.com at 11/03/2023, 00:08:18 on vijay pawar's Cluster
    Cmd 79
    counts = y_test.groupBy('ACTIVITY').agg(count('*').alias('count'))
    activity_counts_pd = counts.toPandas()
    
    activity_counts_pd.plot(kind='bar', x='ACTIVITY', y='count', color='red', figsize=(15,5), legend=False, fontsize=15)
    plt.title('Row count per activity', fontsize=15)
    plt.show()

    (2) Spark Jobs

    Command took 0.73 seconds -- by vijaypawar6677@gmail.com at 11/03/2023, 00:10:05 on vijay pawar's Cluster
    Cmd 80
    from sklearn.neighbors import KNeighborsClassifier
    Command took 0.07 seconds -- by vijaypawar6677@gmail.com at 11/03/2023, 00:12:31 on vijay pawar's Cluster
    Cmd 81
    knn_classifier = KNeighborsClassifier()
    Command took 0.09 seconds -- by vijaypawar6677@gmail.com at 11/03/2023, 00:12:44 on vijay pawar's Cluster
    Cmd 82
    my_param_grid = {'n_neighbors': [5, 10, 20], 'leaf_size': [20, 30, 40]}
    Command took 0.06 seconds -- by vijaypawar6677@gmail.com at 11/03/2023, 00:12:55 on vijay pawar's Cluster
    Cmd 83
    from sklearn.model_selection import cross_val_score
    from sklearn.model_selection import StratifiedShuffleSplit
    from sklearn.model_selection import GroupKFold
    
    my_cv = StratifiedShuffleSplit(n_splits=5, train_size=0.7, test_size=0.3)
    Command took 0.05 seconds -- by vijaypawar6677@gmail.com at 11/03/2023, 00:13:55 on vijay pawar's Cluster
    Cmd 84
    from sklearn.model_selection import GridSearchCV
    knn_model_gs = GridSearchCV(estimator = knn_classifier, 
                                param_grid = my_param_grid,
                                cv = my_cv, 
                                scoring ='accuracy')
    Command took 0.06 seconds -- by vijaypawar6677@gmail.com at 11/03/2023, 00:15:30 on vijay pawar's Cluster
    Cmd 85
    X_train_pan = X_train.toPandas()
    X_train_pan.shape

    (1) Spark Jobs

    Out[61]: (15485, 85)
    Command took 0.79 seconds -- by vijaypawar6677@gmail.com at 11/03/2023, 00:18:31 on vijay pawar's Cluster
    Cmd 86
    y_train_pan = y_train.toPandas()
    y_train_pan.shape

    (1) Spark Jobs

    Out[64]: (15485, 1)
    Command took 0.21 seconds -- by vijaypawar6677@gmail.com at 11/03/2023, 00:19:16 on vijay pawar's Cluster
    Cmd 87
    knn_model_gs.fit(X_train_pan, y_train_pan)
    /databricks/python/lib/python3.9/site-packages/sklearn/neighbors/_classification.py:179: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel(). return self._fit(X, y) /databricks/python/lib/python3.9/site-packages/sklearn/neighbors/_classification.py:179: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel(). return self._fit(X, y) /databricks/python/lib/python3.9/site-packages/sklearn/neighbors/_classification.py:179: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel(). return self._fit(X, y) /databricks/python/lib/python3.9/site-packages/sklearn/neighbors/_classification.py:179: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel(). return self._fit(X, y) /databricks/python/lib/python3.9/site-packages/sklearn/neighbors/_classification.py:179: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel(). return self._fit(X, y) /databricks/python/lib/python3.9/site-packages/sklearn/neighbors/_classification.py:179: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel(). return self._fit(X, y) /databricks/python/lib/python3.9/site-packages/sklearn/neighbors/_classification.py:179: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel(). return self._fit(X, y) /databricks/python/lib/python3.9/site-packages/sklearn/neighbors/_classification.py:179: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel(). return self._fit(X, y) /databricks/python/lib/python3.9/site-packages/sklearn/neighbors/_classification.py:179: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel(). return self._fit(X, y) /databricks/python/lib/python3.9/site-packages/sklearn/neighbors/_classification.py:179: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel(). return self._fit(X, y) /databricks/python/lib/python3.9/site-packages/sklearn/neighbors/_classification.py:179: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel(). return self._fit(X, y) /databricks/python/lib/python3.9/site-packages/sklearn/neighbors/_classification.py:179: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel(). return self._fit(X, y) /databricks/python/lib/python3.9/site-packages/sklearn/neighbors/_classification.py:179: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel(). return self._fit(X, y) /databricks/python/lib/python3.9/site-packages/sklearn/neighbors/_classification.py:179: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel(). return self._fit(X, y) /databricks/python/lib/python3.9/site-packages/sklearn/neighbors/_classification.py:179: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel(). return self._fit(X, y) /databricks/python/lib/python3.9/site-packages/sklearn/neighbors/_classification.py:179: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel(). return self._fit(X, y) /databricks/python/lib/python3.9/site-packages/sklearn/neighbors/_classification.py:179: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel(). return self._fit(X, y) /databricks/python/lib/python3.9/site-packages/sklearn/neighbors/_classification.py:179: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel(). return self._fit(X, y) /databricks/python/lib/python3.9/site-packages/sklearn/neighbors/_classification.py:179: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel(). return self._fit(X, y) /databricks/python/lib/python3.9/site-packages/sklearn/neighbors/_classification.py:179: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel(). return self._fit(X, y) /databricks/python/lib/python3.9/site-packages/sklearn/neighbors/_classification.py:179: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel(). return self._fit(X, y) /databricks/python/lib/python3.9/site-packages/sklearn/neighbors/_classification.py:179: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel(). return self._fit(X, y) /databricks/python/lib/python3.9/site-packages/sklearn/neighbors/_classification.py:179: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel(). return self._fit(X, y) /databricks/python/lib/python3.9/site-packages/sklearn/neighbors/_classification.py:179: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel(). return self._fit(X, y) /databricks/python/lib/python3.9/site-packages/sklearn/neighbors/_classification.py:179: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel(). return self._fit(X, y) /databricks/python/lib/python3.9/site-packages/sklearn/neighbors/_classification.py:179: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel(). return self._fit(X, y) /databricks/python/lib/python3.9/site-packages/sklearn/neighbors/_classification.py:179: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel(). return self._fit(X, y) /databricks/python/lib/python3.9/site-packages/sklearn/neighbors/_classification.py:179: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel(). return self._fit(X, y) /databricks/python/lib/python3.9/site-packages/sklearn/neighbors/_classification.py:179: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel(). return self._fit(X, y) /databricks/python/lib/python3.9/site-packages/sklearn/neighbors/_classification.py:179: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel(). return self._fit(X, y) /databricks/python/lib/python3.9/site-packages/sklearn/neighbors/_classification.py:179: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel(). return self._fit(X, y) /databricks/python/lib/python3.9/site-packages/sklearn/neighbors/_classification.py:179: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel(). return self._fit(X, y) /databricks/python/lib/python3.9/site-packages/sklearn/neighbors/_classification.py:179: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel(). return self._fit(X, y) /databricks/python/lib/python3.9/site-packages/sklearn/neighbors/_classification.py:179: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel(). return self._fit(X, y) /databricks/python/lib/python3.9/site-packages/sklearn/neighbors/_classification.py:179: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel(). return self._fit(X, y) /databricks/python/lib/python3.9/site-packages/sklearn/neighbors/_classification.py:179: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel(). return self._fit(X, y) /databricks/python/lib/python3.9/site-packages/sklearn/neighbors/_classification.py:179: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel(). return self._fit(X, y) /databricks/python/lib/python3.9/site-packages/sklearn/neighbors/_classification.py:179: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel(). return self._fit(X, y) /databricks/python/lib/python3.9/site-packages/sklearn/neighbors/_classification.py:179: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel(). return self._fit(X, y) /databricks/python/lib/python3.9/site-packages/sklearn/neighbors/_classification.py:179: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel(). return self._fit(X, y) /databricks/python/lib/python3.9/site-packages/sklearn/neighbors/_classification.py:179: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel(). return self._fit(X, y) /databricks/python/lib/python3.9/site-packages/sklearn/neighbors/_classification.py:179: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel(). return self._fit(X, y) /databricks/python/lib/python3.9/site-packages/sklearn/neighbors/_classification.py:179: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel(). return self._fit(X, y) /databricks/python/lib/python3.9/site-packages/sklearn/neighbors/_classification.py:179: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel(). return self._fit(X, y) /databricks/python/lib/python3.9/site-packages/sklearn/neighbors/_classification.py:179: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel(). return self._fit(X, y) /databricks/python/lib/python3.9/site-packages/sklearn/neighbors/_classification.py:179: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel(). return self._fit(X, y) Out[65]: GridSearchCV(cv=StratifiedShuffleSplit(n_splits=5, random_state=None, test_size=0.3, train_size=0.7), estimator=KNeighborsClassifier(), param_grid={'leaf_size': [20, 30, 40], 'n_neighbors': [5, 10, 20]}, scoring='accuracy')
    Command took 1.30 minutes -- by vijaypawar6677@gmail.com at 11/03/2023, 00:19:32 on vijay pawar's Cluster
    Cmd 88
    knn_best_classifier = knn_model_gs.best_estimator_
    knn_best_classifier
    Out[67]: KNeighborsClassifier(leaf_size=20)
    Command took 0.07 seconds -- by vijaypawar6677@gmail.com at 11/03/2023, 00:21:18 on vijay pawar's Cluster
    Cmd 89
    print(knn_model_gs.best_params_)
    {'leaf_size': 20, 'n_neighbors': 5}
    Command took 0.09 seconds -- by vijaypawar6677@gmail.com at 11/03/2023, 00:21:22 on vijay pawar's Cluster
    Cmd 90
    knn_model_gs.cv_results_
    Out[69]: {'mean_fit_time': array([0.21512594, 0.22158017, 0.21417394, 0.21585569, 0.21538305, 0.21365356, 0.21426501, 0.21886706, 0.21291003]), 'std_fit_time': array([0.00643368, 0.0089054 , 0.00182238, 0.00474893, 0.00261459, 0.00409957, 0.00411009, 0.00181726, 0.00246076]), 'mean_score_time': array([1.48293386, 1.48728318, 1.50197349, 1.49293113, 1.49982138, 1.50211878, 1.48973308, 1.50537038, 1.49796567]), 'std_score_time': array([0.06093053, 0.04335954, 0.06559606, 0.05221255, 0.04975976, 0.05890101, 0.06168007, 0.05555911, 0.06130784]), 'param_leaf_size': masked_array(data=[20, 20, 20, 30, 30, 30, 40, 40, 40], mask=[False, False, False, False, False, False, False, False, False], fill_value='?', dtype=object), 'param_n_neighbors': masked_array(data=[5, 10, 20, 5, 10, 20, 5, 10, 20], mask=[False, False, False, False, False, False, False, False, False], fill_value='?', dtype=object), 'params': [{'leaf_size': 20, 'n_neighbors': 5}, {'leaf_size': 20, 'n_neighbors': 10}, {'leaf_size': 20, 'n_neighbors': 20}, {'leaf_size': 30, 'n_neighbors': 5}, {'leaf_size': 30, 'n_neighbors': 10}, {'leaf_size': 30, 'n_neighbors': 20}, {'leaf_size': 40, 'n_neighbors': 5}, {'leaf_size': 40, 'n_neighbors': 10}, {'leaf_size': 40, 'n_neighbors': 20}], 'split0_test_score': array([0.73762376, 0.69586741, 0.62010331, 0.73762376, 0.69586741, 0.62010331, 0.73762376, 0.69586741, 0.62010331]), 'split1_test_score': array([0.74343521, 0.70275506, 0.62785192, 0.74343521, 0.70275506, 0.62785192, 0.74343521, 0.70275506, 0.62785192]), 'split2_test_score': array([0.73568661, 0.69888076, 0.63151098, 0.73568661, 0.69888076, 0.63151098, 0.73568661, 0.69888076, 0.63151098]), 'split3_test_score': array([0.73977615, 0.70253982, 0.6183814 , 0.73977615, 0.70253982, 0.6183814 , 0.73977615, 0.70253982, 0.6183814 ]), 'split4_test_score': array([0.7455876 , 0.70318554, 0.62763668, 0.7455876 , 0.70318554, 0.62763668, 0.7455876 , 0.70318554, 0.62763668]), 'mean_test_score': array([0.74042187, 0.70064572, 0.62509686, 0.74042187, 0.70064572, 0.62509686, 0.74042187, 0.70064572, 0.62509686]), 'std_test_score': array([0.00364511, 0.00284376, 0.00500429, 0.00364511, 0.00284376, 0.00500429, 0.00364511, 0.00284376, 0.00500429]), 'rank_test_score': array([1, 4, 7, 1, 4, 7, 1, 4, 7], dtype=int32)}
    Command took 0.06 seconds -- by vijaypawar6677@gmail.com at 11/03/2023, 00:21:22 on vijay pawar's Cluster
    Cmd 91
    knn_best_classifier.get_params()
    Out[70]: {'algorithm': 'auto', 'leaf_size': 20, 'metric': 'minkowski', 'metric_params': None, 'n_jobs': None, 'n_neighbors': 5, 'p': 2, 'weights': 'uniform'}
    Command took 0.11 seconds -- by vijaypawar6677@gmail.com at 11/03/2023, 00:21:27 on vijay pawar's Cluster
    Cmd 92
    scores = cross_val_score(knn_best_classifier, X_train_pan, y_train_pan, cv=my_cv, scoring='accuracy')
    list(scores)
    /databricks/python/lib/python3.9/site-packages/sklearn/neighbors/_classification.py:179: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel(). return self._fit(X, y) /databricks/python/lib/python3.9/site-packages/sklearn/neighbors/_classification.py:179: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel(). return self._fit(X, y) /databricks/python/lib/python3.9/site-packages/sklearn/neighbors/_classification.py:179: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel(). return self._fit(X, y) /databricks/python/lib/python3.9/site-packages/sklearn/neighbors/_classification.py:179: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel(). return self._fit(X, y) /databricks/python/lib/python3.9/site-packages/sklearn/neighbors/_classification.py:179: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel(). return self._fit(X, y) Out[72]: [0.7477399913904433, 0.7470942746448558, 0.7294446835987947, 0.7354713732242789, 0.7488161859664227]
    Command took 8.74 seconds -- by vijaypawar6677@gmail.com at 11/03/2023, 00:21:46 on vijay pawar's Cluster
    Cmd 93
    y_train_pred=knn_best_classifier.predict(X_train_pan)
    Command took 6.57 seconds -- by vijaypawar6677@gmail.com at 11/03/2023, 00:22:23 on vijay pawar's Cluster
    Cmd 94
    from sklearn.metrics import accuracy_score
    accuracy_score(y_true=y_train_pan, y_pred=y_train_pred)
    Out[76]: 0.8407491120439134
    Command took 0.08 seconds -- by vijaypawar6677@gmail.com at 11/03/2023, 00:23:27 on vijay pawar's Cluster
    Cmd 95
    X_test_pan = X_test.toPandas()
    y_test_pred = knn_best_classifier.predict(X_test_pan)

    (1) Spark Jobs

    Command took 5.46 seconds -- by vijaypawar6677@gmail.com at 11/03/2023, 00:24:22 on vijay pawar's Cluster
    Cmd 96
    y_test_pan = y_test.toPandas()

    (1) Spark Jobs

    Command took 0.24 seconds -- by vijaypawar6677@gmail.com at 11/03/2023, 00:25:46 on vijay pawar's Cluster
    Cmd 97
    from sklearn.metrics import confusion_matrix
    cm = confusion_matrix(y_true=y_test_pan,
                          y_pred=y_test_pred)
        
    cm_act = pd.DataFrame(cm,
                          index = knn_best_classifier.classes_,
                          columns = knn_best_classifier.classes_)
    
    cm_act.columns = activity_codes_mapping.values()
    cm_act.index = activity_codes_mapping.values()
    cm_act
    Command took 0.15 seconds -- by vijaypawar6677@gmail.com at 11/03/2023, 00:25:48 on vijay pawar's Cluster
    Cmd 98
    import seaborn as sns
    sns.set(font_scale=1.6)
    fig, ax = plt.subplots(figsize=(12,10))
    _ = sns.heatmap(cm_act, cmap="YlGnBu")
    Command took 0.91 seconds -- by vijaypawar6677@gmail.com at 11/03/2023, 00:26:19 on vijay pawar's Cluster
    Cmd 99
    import numpy as np
    accuracy_per_activity = pd.DataFrame([cm_act.iloc[i][i]/np.sum(cm_act.iloc[i]) for i in range(18)],index=activity_codes_mapping.values())
    accuracy_per_activity
    Command took 0.10 seconds -- by vijaypawar6677@gmail.com at 11/03/2023, 00:28:26 on vijay pawar's Cluster
    Cmd 100
    1
    from sklearn.metrics import classification_report
    2
    print(classification_report(y_true=y_test_pan,
    3
    y_pred=y_test_pred))
    precision recall f1-score support A 0.83 0.97 0.90 292 B 0.94 0.94 0.94 299 C 0.82 0.87 0.84 268 D 0.71 0.71 0.71 289 E 0.77 0.79 0.78 294 F 0.69 0.74 0.71 268 G 0.73 0.83 0.78 294 H 0.65 0.65 0.65 286 I 0.69 0.62 0.65 282 J 0.71 0.69 0.70 269 K 0.62 0.56 0.59 301 L 0.69 0.65 0.67 283 M 0.78 0.82 0.80 299 O 0.77 0.75 0.76 290 P 0.82 0.81 0.81 287 Q 0.76 0.76 0.76 283 R 0.85 0.75 0.80 291 S 0.80 0.78 0.79 288 accuracy 0.76 5163 macro avg 0.76 0.76 0.76 5163 weighted avg 0.76 0.76 0.76 5163
    Command took 0.20 seconds -- by vijaypawar6677@gmail.com at 11/03/2023, 00:29:35 on vijay pawar's Cluster
    Cmd 101
    accuracy_score(y_true = y_test_pan, y_pred = y_test_pred)
    Out[90]: 0.7594421847762929
    Command took 0.07 seconds -- by vijaypawar6677@gmail.com at 11/03/2023, 00:30:13 on vijay pawar's Cluster
    Cmd 102

    Models mean prediction accuracy scores summary:

    Decision Tree: 0.33

    Random Forest: 0.44

    Logistic Regression: 0.38

    KNN: 0.75

    Cmd 103

    Insights & Conclusions:

    1. The K Nearest Neighbor classifier has prooved to provide substantial higher prediction accuracy than the rest of the models (overall mean accuracy ~0.75 on test set) in this case.

    2. The analysis demonstrated typical differentiation of detection accuracy accross the various physical activities

    Shift+Enter to run
    Shift+Ctrl+Enter to run selected text